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The performance of a number of different measures of non- 
linearity in a time series is compared numerically. Their power 
to distinguish noisy chaotic data from linear stochastic surro- 
gates is determined by Monte Carlo simulation for a number 
of typical data problems. The main result is that the rat- 
ings of the different measures vary from example to example. 
It seems therefore preferable to use an algorithm with good 
overall performance, that is, higher order autocorrelations or 
nonlinear prediction errors. 
PACS: 05.45. +b 



I. INTRODUCTION 

The theory of nonhnear, deterministic dynamical sys- 
tems provides powerful theoretical tools to characterize 
geometrical and dynamical properties of the attractors 
of such systems. Alongside the theoretical understand- 
ing of these systems, many of the typical phenomena have 
been realized in laboratory experiments. Many attempts 
have been made to detect behavior characteristic for de- 
terministic systems also in field data, that is, time series 
recordings of real world phenomena. Not surprisingly, 
the coarse nature of these time series (finite number of 
points with finite resolution) makes it difficult to obtain 
unambiguous results. As a particular example, it has 
been pointed out [Q that linear stochastic processes with 
long range autocorrelations can lead to spuriously small 
estimates of the attractor dimension. (See also the dis- 
cussion in Ref. Q.) The method of surrogate data ||] 
provides a rigorous statistical test for the null hypothesis 
that the data has been generated by a linear stochas- 
tic process. If this null hypothesis cannot be rejected, 
the results of a nonlinear analysis have to be regarded 
as spurious. In such a test, the value of some measure 
of nonlinearity is compared for the data and a number 
of randomized samples, the surrogates. The nonlinearity 
measure should be sensitive to the kind of nonlinearity 
suspected in the data and it should be possible to esti- 
mate its value with low variance. In this paper we will 
numerically compare the performance of a selection of 
measures which have been proposed in the literature. 

Apart from the mere detection of nonlinearity, nonlin- 
ear observables can be used to discriminate between dis- 
tinct states of a system on the base of time series data. 
Most notably, claims have been made that measures de- 
rived from chaos theory are able to distinguish healthy 
patients from those with pathological biological rhythms, 
for example cardiac arrhythmiae pHq]. The results pre- 



sented in this paper are also of relevance for the question 
of the preferable discriminating statistic in such a con- 
text. The most striking observation is that although the 
simplest observables, notably simple prediction errors, 
show good overall performance, results differ immensely 
from application to application, which may explain the 
partially contradicting claims in the literature. If enough 
data is available to be split into a training set and a test 
set, and if a model for a reasonable alternative hypothe- 
sis can be constructed, then optimization of the test on 
typical data may be worthwhile. 



II. TESTING FOR NONLINEARITY WITH 
SURROGATE DATA 

Currently, the most general null hypothesis we know 
how to test against is that the data was generated by 
a stationary Gaussian linear stochastic process, maybe 
measured through an instantaneous measurement func- 
tion 10]. Deviations from this null hypothesis can be de- 
tected by computing some nonlinear observable on the 
data. Since the probability distributions of such observ- 
ables are generally not known analytically, they must be 
estimated by Monte Carlo resampling of the data. For 
this purpose one generates random data sets (surrogates) 
which conserve those properties of the data which are ir- 
relevant for a given choice of the null hypothesis. For the 
hypothesis of a Gaussian linear stochastic process, the 
data and the surrogates must have the same autocor- 
relation function or, equivalently, the same power spec- 
trum. For a nonlinearity test allowing for simple rescal- 
ings, also the single time probability distribution must 
be conserved. A (nonlinear) observable t = t{{xn}) is 
estimated on the original data and all of the B sur- 
rogates {xfj}, k = 1, . . . ,B. The distribution of t can be 
estimated from the values tk — t({x^}). One can then 
test at a given level of significance for the assumption 
that ^0 = tii^n}) drawn from the same distribution. 
If this assumption is rejected, the original data {x"} is 
taken to be different from the linear surrogates and is 
thus considered to be nonlinear at this level of signifi- 
cance. 

The use of surrogate data has been promoted in the 
context of chaotic time series in Ref. Q. Although the 
technique has made distinguishing chaos from noise much 
safer, some caveats remain. These will not be discussed in 
this paper, Refs. |P-p^ provide noteworthy material. We 
will throughout use examples where the known problems 
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(nonstationarity, long coherence times) are of no concern. 

There are two important parameters which character- 
ize the performance of a statistical test. One is its size 
a, which is the probability that the null hypothesis is 
rejected although it is in fact true. Specifying a level of 
significance 1 — p of the test amounts to the statement 
that its size does not exceed p. It is customary to specify 
p a priori and design the test accordingly. The important 
question if the surrogate data test has indeed the speci- 
fied size has been previously addressed, see Refs. |0,||,|o[. 
If the actual probability of a false rejection is larger than 
p, the test yields incorrect results. The above references 
give examples where this situation can occur with sur- 
rogate data tests. While excessive size renders the test 
useless, an actual size which is smaller than p is formally 
admissible. However, it can result in a dramatic decrease 
in discrimination power. In such cases (for example if a 
fitted linear model is run to generate surrogates), it is 
therefore advisable to calibrate the test by using "sur- 
rogate surrogate data" |10|. Since the size of the test 
may depend on the particular realization of the null hy- 
pothesis, this calibration is usually quite cumbersome. 
We verified the correct test size for all the numerical ex- 
amples in this paper by performing a series of tests on 
surrogate data fulfilling the null hypothesis. 

While the size predominantly assesses the quality of 
the surrogate data sets, we want to evaluate in this paper 
the abilities of different observables t to detect nonlinear- 
ity. This property is quantified by the power (3 of the 
test. It is defined as the probability to correctly reject 
the null hypothesis when it is indeed false. The power of 
a statistical test can be determined empirically by repeat- 
ing the test many times on different realizations of the 
data. Since we cannot make strong assumptions about 
the distributions of the observables, there is no alterna- 
tive to this computationally expensive approach. In order 
to limit the computational effort, however, we performed 
tests at a rather low level of significance, for which only 
few surrogate data sets are necessary. 



III. MEASURES OF NONLINEARITY 

We evaluated a number of different nonlinear observ- 
ables. Most of them are at least inspired by the the- 
ory of nonlinear dynamical systems and rely on a time 
delay embedding of the scalar time series. Embedding 
vectors in m dimensions are formed as usual: x„ = 
{xn-(m-i)Ti ■ ■ ■ T^n), where T is the delay time. Since 
the Grassberger-Procaccia correlation dimension D2 fill ] 
seems to be among the most popular measures we consid- 
ered several variants of this algorithm. The correlation 
sum C(e) at a scale e is given by 



C(e) = const. X ^ 6(||a?i 



(1) 



K-j|>t., 



Dynamically correlated pairs are discarded as usual and 
const, refers to the normalization. Since none of the ex- 
amples in this study would allow for the identification of 
a true scaling region, we will choose the length scales for 
good discrimination power. Of course, this will make an 
interpretation as a fractal dimension or complexity mea- 
sure impossible. In particular we implemented two ways 
of turning C(e) into a single number. 

1. A maximum likelihood (ML) estimator of the 
Grassberger-Procaccia correlation dimension D2 is 
given by: 



t^L(m,r,6) 



C„(e) 



/c 



e' 



de' 



(2) 



This expression is taken from Ref. [|T2|. Maximum 
likelihood estimation of the correlation dimension 
goes back to Ref. [|l3|. Therefore such quantities 
are generally referred to as Takens ' estimator. 

2. Brock et al. (BDS) jl^ have shown that for a se- 
quence of independent random numbers, Cm(e) = 
Ci(e)'" holds, where m is the embedding dimen- 
sion. In the same paper, also a formal test for this 
property is introduced. Instead of the original BDS 
statistic which has been introduced in order to be 
able to give the asymptotic form of the probability 
distribution, we use the simpler expression 



i«^^(m,T,e)=C™(e)/Ci(e)^ 



(3) 



Other choices we have tried are values of C(e) at fixed 
length scales, which gave consistently less power, and di- 
mension estimators based on pointwise dimensions. In 
the latter case, the scaling exponent of neighbor distances 
is determined for each point separately. The actual ob- 
servable is then the mean or the median of these val- 
ues 15|. Since we did not find any interesting devia- 
tions from the power of the maximum likelihood estima- 
tor t^^ we did not include detailed results in this paper. 

Many quantities which have been proposed in the lit- 
erature for nonlinearity testing in some way or the other 
quantify the nonlinear predictability of the signal. Ex- 
amples include the statistic proposed by Kaplan and 
Glass and to some extent also the false nearest neigh- 
bors techniques . We use a particularly stable repre- 
sentative of the class of predictability measures: 

3. A nonlinear prediction error with respect to a lo- 
cally constant predictor F can be defined by 



tP^(m,T,e)= (^[a:„+i-F(x„)]2) 



1/2 



(4) 



The prediction over one time step is performed by 
averaging over the future values of all neighboring 
delay vectors closer than e in m dimensions. 
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In Ref. jij] a nonlinear Volterra- Wiener model is claimed 
to be superior to other techniques when applied to short 
noisy signals. We have compared the maximal feasi- 
ble noise level for a detection of nonlinearity quoted in 
Ref. ||l8| to the performance of the locally constant pre- 
dictor above for the Henon, Ikeda, and Lorcnz series. 
We found that t^^ gave either better (Henon, Ikeda) or 
comparable (Lorenz) performance and therefore did not 
include the Volterra- Wiener model in this study. 

Further, we used the following nonlinear observables: 

4. Linear (two point) autocovariances can be gener- 
alized by introducing more than one lag. In the 
spectral domain, this generalization leads to the 
bispectrum and polyspectra [|9|. Our (somewhat 
arbitrary) choice of a higher order autocovariance 
(or cumulant) is 

t'^^{T) = {XnXn-rXn-2T) ■ (5) 

5. A simple quantity which is frequently used to detect 
deviations from timc-reversibility is 

i^E^(r) = ((a;„-x„_,)3). (6) 

We have explicitly indicated the adjustable parameters 
which can be chosen using several different strategies. 
One possibility is to optimize the adjustable parameters. 
This has to be done either on data which is not subse- 
quently used for the test, or it has to be done individually 
for each data set and surrogate. The former requires the 
knowledge of the correct answer for the "training data" 
which is rather uncommon. The latter is computationally 
extremely expensive and care has to be taken in order to 
avoid overfitting of the data. Note that for example min- 
imizing prediction errors does not necessarily optimize 
the discrimination power. 

In the present work, we fix as many parameters as pos- 
sible to reasonable ad hoc values prior to the tests. Before 
each test, a brief survey was performed as to which em- 
bedding dimensions and delay times lead to satisfactory 
results for each quantity. We feel that this procedure 
comes closest to what one can do in practice, where also 
a formal optimization of the discrimination power is im- 
possible. The length scale e was either determined as a 
fixed fraction (1/4) of the root mean squared (^, |[) or 
the peak-to-peak amplitude (||.) of the data. 

IV. IMPLEMENTATION AND RESULTS 

The surrogate data sets will be generated as described 
in Ref. 0, which is the appropriate method when the 
null hypothesis is that the data has been generated 
by a Gaussian linear stochastic process, possibly mea- 
sured through a monotonic, instantaneous, time inde- 
pendent measurement function. In brief, the method is 



based on an ordinary phase randomized surrogate series 
S — {smn — 1, . . . , N} which has the same sample power 
spectrum as the time series X = {x„,n = l,...,N}. 
Such a surrogate is obtained by taking the Fourier trans- 
form of X, randomizing the phases, and inverting the 
transform. Now the following two steps are iterated al- 
ternatingly: 

1. The surrogate series is brought to the sample dis- 
tribution of X by rank-ordering: 

Here, rank(s„) = k and index(fc) = n if s„ is the 
fc-th smallest value in S. After this step, S' and X 
have the same distribution of values, but the power 
spectrum may have changed. 

2. The Fourier amplitudes of S' = {s'^, n — 1, . . . , N} 
are replaced by those of X. The resulting series S" 
has the same sample power spectrum as X. This 
step may however alter the distribution of values. 

In Ref. numerical evidence and heuristic arguments 
are given that this scheme indeed converges to a sequence 
with the same distribution and the same power spectrum 
as the data. While formal convergence can only be ex- 
pected for infinitely long sequences, the approximation is 
satisfactory for finite data length. If the deviation from 
a Gaussian distribution or the linear correlations in the 
time series are not too strong, the usual amplitude ad- 
justed phase randomized surrogates jsj yield an accurate 
test as well. Our results do not explicitly depend on the 
particular method of generating constrained Monte Carlo 
realizations. 

As mentioned before, we do not know the probability 
distributions of the nonlinear observables used in this 
paper. In particular, Gaussianity cannot be assumed. 
Therefore we have to employ a non-parametric, rank- 
based test, as it has been suggested in Ref. A test 
is called one-sided if the null hypothesis is rejected only 
if the data deviates from the surrogates in a specified 
direction. In this case and at a given size a, we create 
B = 1/a — 1 surrogate data sets and compute the test 
statistic to on the original data set and its value tk,k = 
1, . . . , i? on each of the surrogates. Since we have a total 
of 1/a sets, the probability for each of them to have the 
smallest value of t by chance is just a, as desired. For 
two-sided tests, we generate B — 2/a — 1 surrogates. 
The probability for any of the 2/a sets to have either the 
smallest or the largest value of t is then again a. 

For the nonlinearity measures inspired by the theory 
of deterministic dynamical systems (0.~|^. above) , we ex- 
pect nonlinearity in the data to result in lower values. 
Thus it is natural to perform one-sided tests. For the 
remaining two measures we perform two-sided tests. In 
order to limit the computational burden, all tests are 
carried out at the 90% level of significance, that is, with 
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statistic 


parameters 


feasible 


noise level Omax 






/3 = 0.95 


/3 = 0.7 




m = 2 


0.7 


0.9 


^BDS 


m = 3 


1.1 


1.3 




m = 3 


1.2 


1.5 


^C3 


r = 1 


1.1 


1.5 


jREV 


r = 1 


1.4 


1.8 



TABLE I. Maximal feasible noise level for the detection 
of nonlinearity with (3 — 0.95 resp. (3 = 0.7. Results for the 
Henon map. 



9 (resp. 19 for two-sided tests) surrogates. For practi- 
cal applications, at least a 95% confidence is usually re- 
quired. The power can be increased by performing tests 
based on more than the minimal number of surrogate 
data sets. 

For purely deterministic signals, we would almost in- 
variably get a discrimination power of /9 = 1. Therefore 
we contaminate deterministic sequences {xn} with noise 
{Vn} which consists of a phase randomized copy of the 
sequence. Thus the noise is random but with the same 
power spectrum as the data {in-hand noise). The noisy 
data is given by: 



1 



1 



:{xn + arjn). 



(8) 



The way the noise is generated and added guarantees 
that the power is not dominated by changes in the auto- 
correlations or the variance of the data. 

One sequence of tests is performed at different noise 
levels in order to determine the maximal feasible noise 
level which allows for the detection of nonlinearity with a 
power of (3 = 0.95 resp. (3 = 0.7. The practical usefulness 
of tests with power less than [3 = 0.7 seems questionable. 
In this sequence, 2000 individual tests with Henon time 
series of length 2048 were carried out for eachpoint. The 
results are summarized in Table | and Fig. ^ For this 
discrete time system, unit time delay seems most appro- 
priate. 

Further, we evaluated the different quantities for a 
number of particular data problems, time series from the 
Lorenz equations, an NMR laser experiment, and an as- 
sembly of uncoupled tent maps. In Table |l| we show 
the results for time series of the Lorenz system at stan- 
dard parameter values. 2048 samples of the x-coordinate 
were recorded every 0.08 time units. Noise of amplitude 
a = 1.3 was added. It was checked for each of the differ- 
ent observables (but with fewer tests) that other choices 
of the lag time and the embedding dimension did not lead 
to significantly better results. 

A long experimental time series from an NMR laser 
experiment [20| was split into 600 segments with 1000 
points each. In-band noise of amplitude a = 0.8 was 
added. Results are shown in Table III . 
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FIG. 1. Comparison of discrimination power for different 
nonlinearity measures and noise levels for Henon data with 
in-band noise. Curves from the left: correlation statistics t 
and t^^^ , prediction error t^^ (crosses), third order cumulant 
t'^'"^, time asymmetry t^^^ . The size of the test was taken to 
be 0.1. 





statistic 


parameters 


power /3 




m = 3 


0.25 ± 0.02 


^BDS 


m = 2 


0.24 ± 0.02 




m = 4 


0.66 ± 0.02 




r = 3 


0.09 ± 0.01 


^REV 


r = 3 


0.10 ± 0.01 




TABLE II 


Fraction of successful rejections out of 1000 


tests, noisy 


Lorenz data. The errors are 


based on the as- 


sumption of 


a binomial distribution for independent trials. 


No significant rejection is possible with t'^^ 


and t^^^. 




statistic 


parameters 


power /3 




m = 2) 


0.61 ± 0.03 


^BDS 


m = 3 


0.86 ± 0.02 




m = 3 


0.79 ± 0.02 


^C3 


r = 3 


0.45 ± 0.03 


^REV 




0.35 ± 0.02 



TABLE III. Fraction of successful rejections out of 600 
tests, noisy NMR laser data. 
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statistic 


parameters 


power P 




m - 


= 2 


0.11 ± 0.03 


^BDS 


m - 


= 2 


0.10 ± 0.03 




m - 


= 4 


0.92 ± 0.03 




T — 


1 


0.10 ± 0.03 


^REV 


T — 


1 


1.00 ± 0.00 



TABLE IV. Fraction of successful rejections out of 100 
tests, sum of 16 uncoupled tent maps. 
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FIG. 2. Discrimination power for uncoupled tent maps. In 
this figure, results are shown for three selected nonlinearity 
measures, from above: time asymmetry t^^^, prediction error 
t^^,m = 4, and ML dimension estimator t^^,m = 2. The 
number of maps was varied in steps of two, each point was 
obtained with 200 tests. The size of the test was taken to be 
0.1. 



Finally, we consider an assembly of uncoupled tent 
maps. Each individual map is given by Xn+i = 2a;„ if 
X < 0.5 and x„+i — 2 — 2x„ if x > 0.5. The recorded 
variable is the sum of the variables of N individual tent 
maps. No noise is added. The discrimination power is 
measured as a function of iV. In Fig. |^ we show the re- 
sults for the time asymmetry, prediction error, and ML 
statistics. Table IV shows the results for all the nonlin- 
earity measures at iV = 16. In this example, the time 
asymmetry statistic is doing extremely well. The pre- 



diction error also gives reasonable power while all other 
quantities basically fail, although different settings for m 
were considered. 



V. CONCLUSIONS 

The results presented in Tables and the figures 

suggest that the root mean squared error of a simple 
nonlinear predictor gives consistently good discrimina- 
tion power. Other nonlinearity measures give even better 
performance in some cases, but fail in others. In partic- 
ular, the time reversal asymmetry does very well most 
of the time but can also fail completely. Asymmetry un- 



der time reversal is a sufficient and powerful indicator 
of nonlinearity, but not a necessary condition. Which 
algorithm is to be preferred in a particular situation de- 
pends on the availability of an independent check for the 
discrimination power. In the typical situation that only 
few precious data sets, or even just one recording (like in 
long-term geophysical observations) is available, it seems 
advisable to use a robust, general purpose statistic with 
few adjustable parameters, for example a simple predic- 
tion error. If asymmetry under time reversal appears un- 
der visual inspection of the data, a simple statistic like 
^REV ^^Yl probably give best results. 

The null hypothesis we have adopted in this work has 
been chosen since it is the most general one that ex- 
cludes nonlinear determinism and that can be tested for 
properly. If we are in fact looking for deterministic struc- 
ture in a signal, then simple statistics like t^^^ and t'~^^ 
which are based on higher order cumulants are not very 
attractive because they are quite sensitive also to those 
deviations from the null hypothesis we are not looking 
for. The formal test discussed in this paper answers the 
question if any deviation from a (rescaled) Gaussian lin- 
ear stochastic process can be detected. Surrogate data 
tests have however been mostly used with the question 
in mind if it is legitimate and useful to use methods from 
dynamical systems theory. This amounts to specifying 
a particular class as an alternative hypothesis. In such 
a case we should choose the discriminating statistic ac- 
cordingly, that is from the arsenal of dynamical systems 
methods. 

Let us finally remark that a couple of tests for non- 
linear properties of time series have been proposed which 
use surrogate data in a different way or not at all. Rather 
than estimating the distribution of the observable t from 
a randomized sample, it is sometimes calculated on the 
base of some assumptions. If the null hypothesis is that 
of a purely Gaussian linear random process (without dis- 
tortion), significance levels for higher order correlation 
functions can be derived. Some authors, e.g. Ref. 
observe that most observables t are temporal averages 
over individual quantities t„ determined for each point 
in a time sequence. In order to derive the distribution of 
t from the knowledge of {tn} however, one has to make 



certain assumptions. Ref. 
independence of the i„ whi 



d assumes Gaussianity and 
e Ref. only needs inde- 



pendence. We do not see what should justify the assump- 
tion of point-to-point independence of {tn} for autocor- 
related time series data, indeed we empirically find the 
assumption to be wrong at least for prediction errors and 
pointwise dimensions. The common positive correlation 
among the t„ leads to an underestimation of the variance 
of the average t and thus to a dangerous overestimation 
of the significance of the test. 
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